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Abstract 

This  paper  presents  a  formal  mathematical 
framework  which  unifies  the  existing  loop  trans¬ 
formations.  This  framework  also  includes  more 
general  classes  of  loop  transformations,  which 
can  extract  more  parallelism  from  a  class  of  pro¬ 
grams  than  the  existing  techniques.  We  classify 
schedules  into  three  classes:  uniform,  subdomain- 
variant,  and  statement-variant.  Viewing  from 
the  degree  of  parallelism  to  be  gained  by  loop 
transformation,  the  schedules  can  also  be  classi¬ 
fied  as  single- sequential  level,  multiple-sequential 
level,  and  mixed  schedules.  We  also  illustrate  the 
usefulness  of  the  more  general  loop  transforma¬ 
tion  with  an  example  program. 

1  Introduction 

One  of  the  central  issues  in  restructuring  com¬ 
piler  is  to  discover  parallelism  automatically  and 
generate  correct  parallel  control  structures  that 
can  take  advantage  of  the  large  number  of  pro¬ 
cessors.  The  advent  of  massively  parallel  ma¬ 
chines  opens  up  opportunities  for  programs  that 
have  large-scale  parallelism  to  gain  tremendous 
performance  over  those  that  do  not. 

This  paper  presents  a  formal  mathematical 
framework  which  unifies  the  existing  loop  trans¬ 
formations  such  as  loop  interchanging  [1,  2,  17, 
19]  permutation  [3],  skewing  [17,  19],  reversal, 
the  wavefront  method  [7,  9,  10,  11,  13,  14,  15], 
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and  statement  reordering.  This  framework  also 
includes  more  general  classes  of  loop  transforma¬ 
tions  which  can  extract  more  parallelism  from  a 
class  of  programs  than  the  existing  techniques. 
The  particular  class  of  programs  are  those  that 
consist  of  perfectly  nested  loops  possibly  with 
conditional  statements  where  the  guards  as  well 
as  the  array  index  expression  are  affine  expres¬ 
sions  of  the  loop  indices. 

In  the  next  section,  we  describe  the  notations 
and  terminologies  used  in  the  paper.  We  then 
present  a  formal  mathematical  framework  which 
unifies  the  existing  loop  transformation  tech¬ 
niques,  and  sets  the  stage  for  discussing  the  more 
general  classes  of  loop  transformers  in  Section  3. 
A  loop  transformer  is  a  function  that  relates  a 
given  loop  nest  with  its  transformed  version,  and 
consists  of  two  parts:  a  spatial  morphism,  and 
a  temporal  morphism,  called  a  schedule.  Next, 
in  Section  4,  we  classify  schedules,  by  the  prop¬ 
erties  of  uniformity,  into  three  classes:  uni¬ 
form,  subdomain-variant,  and  statement-variant. 
Viewing  from  the  degree  of  parallelism  to  be 
gained  by  loop  transformation,  the  schedules 
can  also  be  classified  as  single- sequential  level, 
multiple-sequential  level,  and  mixed  schedules. 
We  also  describe  the  functional  forms  of  the 
schedules  for  each  class.  Existing  loop  transfor¬ 
mation  techniques  are  given  as  examples  of  these 
classes  of  schedules. 

Due  to  the  limited  space,  please  refer  to  [12] 
for  the  algorithms  for  obtaining  the  more  gen¬ 
eral  classes  of  schedules.  The  problem  formula¬ 
tions  for  obtaining  these  schedules  are  based  on 


1 


dependence  index  pairs,  which  provide  more  de¬ 
pendence  information  than  dependence  vectors. 
Since  there  are  many  such  pairs  that  need  to 
be  considered,  and  they  can  be  infinitely  many 
when  the  loop  bounds  are  unknown  at  compile 
time,  we  need  to  rely  on  a  technique  called  poly- 
hedra  decomposition  [8,  15]  to  manage  the  com¬ 
plexity  of  the  algorithm.  In  addition,  nonlinear 
programming  and  bounded  enumerative  search 
are  required  to  obtain  optimal  schedules.  The 
complexity  of  nonlinear  programming  is  reduced 
by  using  fast  heuristics  and  linear  programming 
as  described  in  [12],  which  obtain  optimal  sched¬ 
ules  for  most  cases. 

Finally,  we  illustrate  the  usefulness  of  the  more 
general  loop  transformations  with  an  example 
program  in  Section  5.  Versions  of  the  trans¬ 
formed  program  using  different  schedules  are  im¬ 
plemented  on  a  Connection  Machine  CM/2.  The 
difference  in  performance,  which  is  essentially 
due  to  the  available  parallelism  determined  by 
the  schedule,  can  amount  to  two  orders  of  mag¬ 
nitude. 

2  Definitions  and  Terminolo¬ 
gies 

Throughout  this  paper,  programming  examples 
are  written  in  a  Fortran-like  notation  although 
the  transformation  techniques  also  apply  to  func¬ 
tional  languages. 

Index  Domains  Let  [a,  6]  be  an  interval  do¬ 
main  of  integers  from  a  to  b.  We  define  an  index 
domain  D  (also  called  an  iteration  space  in  [17]) 
of  a  d-level  perfectly  nested  loop 

Loop  Nest  1 

DO  (t'i  =  /i,«i)  { 

DO  (...){ 

DO  (id  =  Id,  Ud)  { 

body}  }  } 

to  be  the  Cartesian  product  [/j ,  Ui]  x  . . .  x  [Id,  «<*] 
of  d  interval  domains  [Ik,  Uk ]  for  1  <  k  <  d. 

For  the  purpose  of  formulating  loop  trans¬ 
formations,  we  consider  D  to  be  a  subset  of 


the  d-dimensional  vector  space  over  rationals. 
Throughout  the  paper,  we  let  I  =  (*i,...,t'd) 
and  J  =  (ji,  •  •  -  ,jd)-  With  the  domain  and  tu¬ 
ple  notations,  Loop  Nest  1  can  be  rewritten  as 
follows: 

DO  (I:D)  { 
body  } 

In  this  paper,  we  focus  on  sequential  loop  nests 
which  are  perfectly  nested.  We  use  the  following 
loop  nest  as  a  generic  example  throughout  the 
paper,  where  D  is  a  d  dimensional  index  domain 
and  r[a]  is  an  expression  containing  a: 

Loop  Nest  L  (Generic  Loop  Nest) 

DO  (I:D){ 

5,:  IF(/>i)A(X(/))=... 

52:  \F(P2)B(Z(I))  =  t[A(Y(I))} 

} 

Data  Dependence  We  now  define  depen¬ 
dence  between  statements.  Let  Si  and  S2  be 
two  statements  of  a  program.  A  flow  dependence 
exists  from  Si  to  S2  if  Si  writes  data  that  can 
subsequently  be  read  by  S2.  An  anti- dependence 
exists  from  S i  to  S2  if  Si  reads  data  that  S2  can 
subsequently  overwrite.  An  output  dependence 
exists  from  Si  to  S2  if  Si  writes  data  that  S2 
can  subsequently  overwrite.  We  use  the  nota¬ 
tion  S i  =>  S2  to  denote  a  dependence  from  Si  to 
S2. 

Consider  Loop  Nest  L.  For  statement  S2  to 
compute  the  value  B(Z(J ))  at  iteration  J,  the 
value  A(Y(J))  is  needed.  If  A(F(d))  is  com¬ 
puted  from  statement  Si  at  iteration  I,  i.e. 
Y(J)  =  X(J),  then  we  say  S2  at  iteration  J  is 
flow  dependent  on  Si  at  iteration  I,  denoted  by 
Si @7  =>•  S2@J. 

3  Formalizing  Loop  Transfor¬ 
mation 

We  now  formalize  the  notion  of  loop  transforma¬ 
tion  from  a  source  loop  nest  to  a  target  paral- 
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Id  loop  nest.  A  loop  transformer  is  a  function 
defined  over  the  Cartesian  product  of  the  itera¬ 
tion  space  of  the  loop  nest  and  the  set  of  state¬ 
ments  in  the  body  of  the  loop  that  relates  a  given 
loop  nest  with  its  transformed  version.  From 
the  standpoint  of  symbolic  transformation  of  the 
program  text,  a  loop  transformer  can  be  decom¬ 
posed  into  two  components:  the  first  component, 
called  domain  morphism ,  defines  how  the  itera¬ 
tion  space  should  be  mapped  to  a  new  one  (with 
new  loop  bounds  and  possibly  new  predicates 
guarding  the  loop  body),  and  the  second  com¬ 
ponent,  called  statement  reordering  function,  de¬ 
fines  the  ordering  of  the  statements  in  the  trans¬ 
formed  loop  nest.  The  process  of  obtaining  a 
loop  transformer,  however,  suggests  another  de¬ 
composition:  a  temporal  morphism  and  a  spatial 
morphism. 

3.1  Loop  Transformer  and  Schedule 

Kinds  of  Index  Domains  For  the  purpose 
of  loop  transformation,  it  is  useful  to  indicate 
how  the  index  domain  shall  be  interpreted.  We 
do  this  by  defining  kinds  of  index  domains.  The 
kind  of  an  interval  domain  D  can  be  either  spatial 
or  temporal.  The  kind  of  a  product  domain  is  the 
product  of  the  kinds  of  the  component  domains. 
For  example,  D1XD2  is  of  kind  temporalxspatial 
if  D\  is  of  kind  temporal  and  D2  is  of  kind  spa¬ 
tial.  A  single-level  loop  with  a  temporal  index 
domain  corresponds  to  a  sequential  loop  (i.e. 
DO),  while  a  spatial  index  domain  corresponds 
to  a  parallel  loop  (i.e.  DOALL). 

Lexicographical  Ordering  We  use  the  fol¬ 
lowing  notations  to  denote  lexicographical  order¬ 
ing  on  elements  X  and  Y  of  an  n-dimensional 
index  domain.  We  define  to  be  the  lexico¬ 
graphical  ordering:  we  say  A’  -<  Y  if  there  exists 
k,  1  <  k  <  n,  such  that  n  —  yi  for  all  /,  /  <  k, 
and  Xk  <  yk-  Similarly,  we  say  X  <  V  if  X  -<  Y 
or  1  k  =  yk  for  all  k,  1  <  k  <  n.  We  use  0  to 
denote  the  zero  vector. 

Domain  Morphism  We  define  a  domain 
morphism  to  be  a  bijective  function  g  from  in¬ 
dex  domain  D  to  index  domain  E,  denoted  by 


g:D  — >  E,  such  that  for  all  dependences  S\©I  => 
S2®J,  condition  g(J)  -  g(I)  V  6  holds.  In  other 
words,  a  domain  morphism  will  never  reverse  the 
ordering  imposed  by  dependence  relations. 

In  this  paper,  we  restrict  the  codomain  E  of 
a  domain  morphism  to  be  a  cross  product  of  a 
temporal  index  domain  E\  and  a  spatial  index 
domain  E2,  i.e.  E  =  E\  x  E2.  Under  this  re¬ 
striction,  all  parallel  loops  are  innermost  loops 
in  the  transformed  loop  nest.  We  define  <71  and 
<72  to  be  two  functions: 

gi  :  D  — ►  E\ 

(called  a  temporal  morphism ),  and 
<72  :  D  —>  E2 

(called  a  spatial  morphism). 

Under  domain  morphism  <7,  index  I  in  the  orig¬ 
inal  loop  will  be  mapped  to  index  J  =  g(I)  in  the 
transformed  loop  nest.  Since  g  is  bijective,  it  has 
a  well-defined  inverse,  denoted  by  g~l .  Clearly, 
I  =  g_1(J).  The  following  loop  nest 

Loop  Nest  2 

DO  ((/:£)){ 

...  A(X(I))  ...} 

will  be  transformed  into  the  following  new  loop 
nest  under  domain  morphism 
g:D  -*■  Ei  x  E2: 

Loop  Nest  3 

DO  ((7x:£i)){ 

DOALL  ((J2:E2)){ 

...  A(X(g~\J1:J2)))  ...}} 

where  ( J\  :  J2)  denotes  the  concatenation  of  two 
vectors  J\  and  J2- 

The  requirement  of  g  to  be  surjective  is  in 
fact  not  essential.  For  any  injective  function 
g':D  —*  E,  we  can  always  derive  a  corresponding 
bijective  function  g:D  -*  {g'(I)  |  I  €  D)  from 
D  to  the  image  of  D  under  g'  [6].  Therefore,  by 
allowing  the  codomain  of  a  bijective  function  to 
be  the  image  of  an  injective  function,  we  allow  & 
much  more  general  class  of  functions  to  be  used 
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as  domain  morphism.  For  comparison,  the  uni- 
modular  transformations  discussed  in  [4,  16]  are 
special  classes  of  bijective  functions.  The  gener¬ 
ality  does  require  some  nontrivial  algebraic  ma¬ 
nipulation  to  generate  correct  loop  bounds  and 
predicates  to  guard  the  conditional  statements  in 
the  transformed  loop  nest.  An  automatic  trans¬ 
formation  procedure  for  doing  this  based  on  an 
equational  theory  is  described  in  [6]. 

Statement  Reordering  We  now  discuss 
statement  reordering.  Let  S  denote  the  set  of 
statements  in  the  loop  body.  We  define  a  state¬ 
ment  reordering  to  be  a  function  h  from  the  set 
of  statements  to  the  set  of  statement  labels: 

r:S  -*  [0,s  —  1],  (2) 

where  s  —  |«S|,  the  number  of  statements  in  S. 

Loop  Transformer  With  g  and  r  defined 
above,  the  following  function  h,  called  the  loop 
transformer ,  specifies  how  a  loop  nest  is  trans¬ 
formed: 

k:Dx5-.£,x£Jx[0,i-l] 

M/,S)  =  (Si(/),<72(/),  re¬ 
schedule  Given  h  defined  above,  a  schedule  n 
is  defined  to  be  a  function 

ir:V  x5-Eix[0,s-l] 
x(l,S)  =  (ffl(I),r(S)), 

such  that  condition  i r(  J,  52)  -  jt(J,  5i)  >-  6  must 
hold  for  all  dependences  Si <9/  =>  S2©J  in  the 
loop  nest.  The  condition  ensures  that  the  or¬ 
dering  imposed  by  dependence  relations  is  pre¬ 
served.  Clearly,  a  schedule  determines  the  se¬ 
quential  execution  of  the  transformed  parallel 
loop  nest.  Note  that  by  the  definition  of  domain 
morphism,  gx(J)~  gx(I)  can  be  equal  to  the  zero 
vector,  i.e.  Si  @7  and  S2© J  can  be  computed  at 
the  same  iteration  in  the  transformed  loop  nest. 
In  this  case,  statement  Si  must  be  in  front  of 
statement  S2  in  the  loop  body,  i.e.  condition 
r(Si)  <  r(S2)  must  hold,  to  preserve  the  depen¬ 
dence  ordering. 


3.2  Overall  Procedure  to  Obtain  a 
New  Loop  Nest 

Finding  a  schedule  ir  is  to  understand  what  is 
the  potential  parallelism  that  can  be  extracted 
from  the  source  program.  The  algorithms  for 
obtaining  a  schedule  7r  is  presented  in  [12].  The 
so-called  strip  mining  [17]  and  tiling  [16,  18]  of 
loops  are  captured  by  the  spatial  morphism  02- 
Given  a  schedule  n  =  (ffi,r),  the  choice  of  02, 
which  depends  on  factors  such  as  memory  and 
processor  organization  and  communication  cost, 
should  keep  a  loop  transformer  h  =  (01,02, T)  in¬ 
jective.  A  default  02,  which  is  used  in  the  rest  of 
this  paper,  can  be  02(ii, . . . ,  id)  =  (iPl ,  -  -  • ,  iPn), 
so  as  to  result  in  a  loop  transformer  h  that  is 
injective,  where  n  is  the  dimensionality  of  the 
spatial  index  domain  E2,  {pi,  ■  .  .  ,pn}  is  a  subset 
of  interval  domain  [1,  d],  and  px  <  . . .  <  pn. 

Overall  Procedure  To  summarize,  the  over¬ 
all  procedure  to  obtain  a  new  loop  nest  is: 

1.  First  generate  a  schedule  ir  =  (01,  r)  to  max¬ 
imize  the  degree  of  parallelism  by  using  the 
algorithms  presented  in  [12]. 

2.  Then  determine  the  spatial  morphism  02  of 
domain  morphism  based  on  target  machine 
characteristics  such  as  memory  and  proces¬ 
sor  organization,  communication  cost,  etc., 
or  use  a  default  function  as  shown  above. 

3.  The  loop  transformer  is  simply  h  — 

( 9i,92,r ). 

4.  Finally  perform  symbolic  program  transfor¬ 
mation,  given  the  source  loop  nest  and  loop 
transformer  h,  to  obtain  the  new  loop  nest. 
For  the  formal  procedure,  please  refer  to  [6], 

We  now  discuss  different  classes  of  schedules 
which  include  the  exiting  schedules  in  one  class. 

4  Classes  of  Affine  and  Piece- 
Wise  Affine  Schedules 

We  call  a  schedule  affine  if  it  is  an  affine  function 
of  the  loop  indices.  We  call  a  schedule  piece- 
wise  affine  if  the  restriction  of  the  function  to 


each  subdomain  of  D  and  each  subset  of  S  is 
affine.  In  the  loop  restructuring  literature,  only 
affine  schedules  are  considered.  In  this  paper,  we 
consider,  in  addition,  piece- wise  affine  schedules. 

We  now  classify  schedules  according  to  two 
properties:  (1)  the  uniformity  of  the  schedule 
with  respect  to  the  the  set  of  statements  S  and 
the  index  domain  D,  and  (2)  the  degree  of  par¬ 
allelism  in  the  transformed  Loop  Nest. 

4.1  Properties  of  Schedules 

Uniformity  Let  index  domain  D  be  parti¬ 
tioned  into  m  disjoint  subdomains  Dk,  1  <  k  < 
to;  and  let  the  set  of  statements  S  be  partitioned 
into  n  disjoint  subsets  Sk ,  1  <  A:  <  n.  The  gen¬ 
eral  form  of  a  piece- wise  affine  schedule  7r  defined 
in  Equation  (4)  consists  of  conditional  branches, 
one  for  each  pair  of  subdomain  D,  and  statement 
subset  Sj,  and  an  affine  expression  of  the  loop 
indices  is  on  the  right-hand  side  of  each  branch. 
We  call  a  schedule 

1.  uniform  if  to  =  1  and  n  =  1, 

2.  subdomain-variant  if  m  >  1  and  n  =  1,  (also 
called  a  subdomain  schedule) 

3.  statement-variant  if  m  =  1  and  n  >  1,  or 

4.  nonuniform  if  to  >  1  and  n  >  1. 

Degree  of  Generated  Parallelism  As  de¬ 
fined  in  Equations  (1)  and  (4),  the  dimensional¬ 
ity  of  E\,  the  temporal  index  domain,  indicates 
the  number  of  levels  of  sequential  loops  in  the 
transformed  loop  nest.  Hence  a  schedule  n  would 
generate  a  target  loop  nest  with  more  levels  of 
parallel  loops  and  thus  potentially  more  paral¬ 
lelism  if  Ei  is  of  lower  dimensionality.  We  call 
the  dimensionality  of  Ei  the  sequential  level  of 
7r.  Schedules  can  thus  be  classified  as: 

1.  Single- sequential  level  schedule  (SSL)  if  Ei 
is  a  subset  of  the  set  of  natural  numbers  Af. 

2.  Multiple-sequential  level  schedule  (MSL)  if 
Ei  is  a  subset  of  Afn,  where  n  is  a  positive 
integer  and  n  <  d,  the  dimensionality  of  the 
original  loop  nest. 


3.  Mixed  schedule  (Mixed)  if  Ei  can  be  of  dif¬ 
ferent  dimensions  for  each  pair  of  subdo¬ 
main  Di  and  statement  subset  Sj.  Such  a 
mixed  schedule  will  result  in  transformed 
programs  consisting  of  imperfectly  nested 
loops. 

4.2  Classification  and  Functional 
Form  of  Schedules 

Classification  Clearly,  the  uniformity  of  n 
and  the  dimensionality  of  ir  are  two  orthogonal 
properties,  except  that  a  mixed  schedule  cannot 
be  uniform.  Thus  there  are  all  together  eleven 
(4  *  3  —  1 )  classes  of  affine  and  piece- wise  affine 
schedules.  The  classes  and  their  acronyms  rang¬ 
ing  from  single-sequential  level  uniform  sched¬ 
ules  to  mixed  nonuniform  schedules  are  in  Fig¬ 
ure  1. 

Functional  Form  We  now  describe  the  forms 
of  affine  and  piece-wise  affine  schedules  by  us¬ 
ing  matrix  and  vector  notations.  Let  r(S)  for  a 
given  S  in  S  be  a  constant  scalar.  Let  d  be  the 
dimensionality  of  the  index  domain  of  the  source 
loop  nest. 

Uniform  Schedule: 

z(I,S)=(TI,r(S)), 

I  €  D,S  €  S, 

where  T  is  a  constant  l-by-d  matrix  and  /  is  the 
sequential  level  of  the  schedule  tt. 

Subdomain  Schedule: 

'  It  Di^(TiI,n(S)) 

r(I,S)=<...  >, 

l£Dm-+(TmI,rm{S))\ 

I  £  D,  S  £  S, 

where  T,,  1  <  i  <  to,  is  a  constant  L-by-d  matrix 
and  /;  is  the  sequential  level  of  the  part  of  the 
schedule  defined  over  £>,. 
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Figure  1:  Classes  of  schedules 


Statement- Variant  Schedule: 

'  S  (z  S\  -*  (7iJ,r(S)) 


*(/,S)  =  { 


[  S  €  <Sn  -+  (T„/,r(5)) 
I  €  D,S  €S, 


ing,  permutation  and  skewing  axe  special  cases  of 
MSL  uniform  schedules. 


^  Example  1:  Loop  Interchanging  and  Per¬ 
mutation  Loop  interchanging  and  loop  per¬ 
mutation  [1, 2,  3, 17, 19]  is  a  process  of  switching 
inner  and  outer  loops.  Suppose  Loop  Nest  1  after 
where  T,-,  1  <  t  <  n,  is  a  constant  /,-by-d  matrix  loop  interchanging  or  loop  permutation  becomes 


and  1;  is  the  sequential  level  of  the  part  of  the 
schedule  defined  over  <S,. 


Loop  Nest  4 


Nonuniform  Schedule: 

*(/,£)  = 

I  €  D\ ,  S  €  Si 


ieD  mi  S  €  Sn 

i  e  d,s  es, 


( TuI,n(S )) 
■(Tmn/,rm(5)) 


(8) 


DO  (ip j  —  /p,  ,«p, ){ 

DO  (...){ 

DO  (iPd  =  lPd,  uPd)  { 
body}  }  }, 


where  (pi,P2,  •  •  •  ,Pd)  is  a  permutation  of 
(1, 2, . . . ,  d).  Also  suppose  the  m  innermost  loops 
are  parallelizable.  The  schedule  tt  has  the  form: 


where  Tii,  1  <  i  <  m  and  1  <  j  <  n,  is  a 
constant  ltJ-by-d  matrix  and  ltJ  is  the  sequential 
level  of  the  part  of  the  schedule  defined  over  JD, 
and  Sj. 

The  linear  term  TI ,  /  6  D,  determines  the 
form  of  the  sequential  loops  in  the  transformed 
loop  nest,  which  includes  nesting  structures, 
bounds,  and  possibly  additional  predicates  to 
guard  the  loop  body.  The  constant  terms  r(5) 
determine  the  orders  of  the  statements  in  the 
transformed  loop  body. 

4.3  Examples  of  Different  Classes  of 
Schedules 

We  now  give  some  examples  of  different  classes 
of  schedules.  We  first  show  that  loop  interchang- 


ir(/,5)=  (tp1,tp2,...,ipd_m,loc(S)),  (9) 
'  V(pi)  N 

i.e.  T  =  ...  ,  and  (10) 

^  V(Pd-m)  j 

r(S)  =  loc(S),  (11) 

where  loc  is  a  function  from  S  to  Af  that  returns 
the  position  of  the  statement  S  in  the  source  loop 
nest,  and  each  V(k)  is  a  vector  of  length  d  with 
&-th  element  being  1  and  all  other  elements  being 
0. 

Example  2:  Loop  Skewing  This  operation 
transforms  Loop  Nest  1  as  follows:  shifting  index 
t'n  with  respect  to  index  im,  1  <  m  <  n  <  d, 
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« 


« 


by  a  factor  of  /,  where  /  is  a  positive  integer, 
replacing  ln  with  the  expression  (/„  +  im  *  /), 
replacing  u„  with  the  expression  (un  +  im  *  /), 
and  replacing  all  occurrences  of  i„  in  the  loop 
with  the  expression  (i„  -  im  *  f)  [17,  19].  The 
transformed  loop  nest  is  of  the  form: 

Loop  Nest  5 

DO  (t‘i  =  /i,ux){ 


DO  (i„  =  /„  +  im  *  f,  Un  +  im  *  /)  { 


will  transform  Loop  Nest  6  into 
Loop  Nest  7 

DO  (i  =  l,n)  { 

DOALL  (j  =  l,n){ 
S2:  B(i,j)  = 

A(i-  l,j)  +  j 
Si  :  A(i,j)  = 

1)  +  i  }  } 


DO  {id  =  ld,ud)  { 
loop  body  with  inbeing 
replaced  by  (in  -  im  *  f)  }  }  } 

The  schedule  for  loop  skewing  is  of  the  form: 


it{I ,  S)  —  (t'i , . . . ,  im!  •  •  •  i 
in  +  f  *  im,  , . .  .,ij, loc(S)), 
n-th  element 


i.e.  T  = 


(  V{1) 


V{n)  +  /  *  V{m) 


,  and 


(12) 


(13) 


Example  4:  MSL  Uniform  Schedule 
Loop  Nest  8 

DO  (i  =  n  -  1, 1,  —1)  { 

DO  (j  =  i  +  l,n){ 

DO  (*  =  .\j){ 

5,  :  IF(»  +  1  =  k) 

B{i,j,k)  =  C(i  +  1  ,j,j) 

52  :  IF(i  +  1  <  k) 

B{i,j,k)  =  B(i  +  l,j,k) 

53  :  IF(i  +  j  +  1  <  2k) 

C{i,j,k)  =  C(i,j,k  -  1)  +  B{iJ,k)  }  }  } 


V  V(d)  J 

r(5)  =  loc(5),  (14) 

where  loc(5)  and  V{k )  are  the  same  as  defined 
in  Example  1. 


A  2  SL  uniform  schedule 

*{('J,k),S)  =  ((-i,fc),loc(S))  (16) 

will  transform  Loop  Nest  8  into  Loop  Nest  9. 


Example  3:  SSL  Uniform  Schedule 
Loop  Nest  0 

DO  (i  =  1,  n)  { 

DO  O'  =  l,n)  { 

Si  :  A(i,j)  = 

B(i,j  -  1)  +  i 
52:  B(i,j)  = 

Mi~  1  J)  +  j  }  } 

An  SSL  uniform  schedule 


Example  5:  Mixed  Statement-Variant 
Schedule  Consider  Loop  Nest  8  again.  The 
following  schedule  transforms  Loop  Nest  8  to 
Loop  Nest  10,  which  consists  of  imperfectly 
nested  loops: 


*{(i,j,k),S) 


'  S  =  53  - 

((-i,k),loc(S)) 

else  — ► 


(17) 


{  (~i,loc(5))  J 


(15)  Example  6:  SSL  Subdomain  Schedule 

Another  possible  transformation  of  Loop  Nest  8 


*((i,j,k),Si)  =  (i,  1),  and 
k),S2)  =  (i,0), 
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Loop  Nest  9 


DO  (is  1  -  n,  -1)  { 

DO  ( k  =  —i,  n)  { 

DOALL  (;'  =  1-  i,  n)  { 

51  :  IF((-i+ls*)A(*<i))B(-i..7,4r)sC(l-ititj) 

52  :  IF((— *  +  1  <  k)  A  (A:  <  j))B(-i,j,k)  =  5(1  -  i.j,k) 

53  :  1F(( — *  +  j  +  1  <  2k)  A  (*  <  »)  C(-i,  j,  k )  =  C(-i,  j,  k  -  l)  +  B{ - i,j ,  /:)}}} 


Loop  Nest  10 

DO  (i  =  1  -n,-l){ 

DOALL  (0=  1-i,  «),(*  =  -*,«)){ 

51  :  IF(( — *  +  1  =  *)A(*<  j))5(-i,i,A:)  =  C(l  -  iJJ) 

52  ;  IF((-i+  1  <  k)  A  (k  <  j))B(-i,j,k)  =  B(1  -  i,j,k )  } 

DO  (A:  =  -t,  n)  { 

DOALL  0  =  1-  *\n)  { 

S3  :  I F ( ( — i  +  j  +  1  <  2*)  A  (*  <  j))C(-i,j,k)  =  C(-i,j,k-  1)  +  B(-iJ,k)  }  }  } 


Loop  Nest  11 

DO  (t  =  2, 2n  -  2)  { 

DOALL  (i=  n  —  1,1,  — 1)  { 

DOALL  0  =  t  +  l,n){ 

Sn  :  IF((2*  +  3i  -  3j  >  0)  A  (t  +  i  -  j  -  1  =  0)) 

B(i,j,t  +  2t  -  j)  =  C(i  +  1 00) 

S12  :  IF((2<  +  3 i  -  3j  >  0)  A  (<  +  2*'  -  2 j  +  1  =  0)) 

B(i,j,-t  -  i  +  2jf)  =  C(i  +  100) 

521  :  IF((2t  +  3i  -  3j  >  0)  A  (<  +  i  -  j  -  1  >  0)) 

B(i,j,  t  +  2i-  j)  =  B(i+  1,  j,  <  +  2i  -  j) 

522  ■  IF((2 1  +  3*  -  3  j  >0)A(t  +  2i-2j+l<  0)) 

B(i,j, -t  -  i  +  2 j)  =  B(i+  l,j, -t  -  2 i  +  2 j) 

531  :  IF((2*  +  3i  —  3 j  >  0)  A  (2 1  +  3 i  -  3j  -  1  >  0)) 

C(i,j,  t+2i-  j)  =  C(i,j ,  t  +  2i  -  j  -  1)  +  £(?0>  t  +  2*  -  » 

532  :  IF((2 1  +  3t  -  3;  >  0)  A  (2t  +  3i  -  3>  +  1  <  0)) 

C(i,j,  -t  —  i  +  2  j)  =  C(i,j , -t-  i  +  2j  -  1)  +  B(i,j ,  -i  +  2J)}}} 
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is  the  schedule: 


r((i,j,k),S) 


i  +  j  -  2k  <  0  -* 

(— 2i  +  j  +  k,  loc(S)) 
t  +  j  -  2k  >  0  -► 

(-i  +  2j  -  k,\oc(S)) ) 


l(18) 


same  problem.  The  three  schedules  are  given  be¬ 
low.  For  simplicity,  we  do  not  give  the  constant 
terms  r(S)  of  function  :r. 


2-SL  uniform  schedule: 
*(S,(t,j,fc))=  ( j  -  *’i  k  —  i) 


(19) 


which  transforms  Loop  Nest  8  into  Loop  Nest  11. 
Since  there  are  two  affine  functions  for  disjoint 
subdomains  of  the  index  domain  of  the  loop  nest, 
each  statement  in  Loop  Nest  8  results  in  two 
guarded  statements  in  the  transformed  loop  nest. 
In  fact  Loop  Nest  8  is  part  of  the  dynamic  pro¬ 
gramming  code  presented  in  Section  5.  As  one 
can  see,  an  SSL  subdomain  schedule  can  result 
in  code  of  considerable  complexity.  It  would  be 
a  very  tedious  and  error-prone  process  for  a  user 
to  write  the  code  by  hand.  But  a  compiler  can 
generate  the  new  loop  nest,  given  the  schedule, 
and  the  original  loop  nest  mechanically. 

5  An  Application:  Dynamic 
Programming 

To  illustrate  the  usefulness  of  the  more  general 
schedules,  we  take  dynamic  programming  as  ar. 
example,  which  has  sequential  complexity  0(n3) 
for  a  problem  of  size  n.  The  source  code  is  given 
in  Loop  Nest  12. 

Loop  Nest  12 

DO  (i  =  1,  n  —  2)  { 

DO  ( j  =  i  +  2,  n)  { 


mixed  statement-variant  schedule: 

f  5  =  5C2  — ► 

ir(S,(i,j,k))=  |  (j -*,*-*) 
I  else  — >  j  -  i 

SSL  subdomain  schedule: 

t  +  j  -  2k  <  0 


(20) 


n(S,  (i,j,k))  =  < 


-2  i  +  j  +  k 
i  +  j-2k>0 
-  i  +  2  j  -  k  J 


(21) 


Experimental  Result  Ine  experiment  is 
conducted  as  follows:  we  run  the  sequential  code 
on  the  Symbolics  and  parallel  codes  on  an  8K- 
processor  Connection  Machine  with  Symbolics 
as  its  host.  The  results  described  in  Figure  2 
and  Figure  3  show  that  the  version  using  an  SSL 
subdomain  schedule  is  three  orders  of  magnitude 
faster  than  the  sequential  code,  and  is  two  or¬ 
ders  of  magnitude  faster  than  the  versions  using 
a  2-SL  uniform  schedule  and  mixed  statement- 
variant  schedule.  And  the  program  using  a 
mixed  statement-variant  schedule  is  about  three 
to  four  times  faster  than  the  program  using  a 
2-SL  uniform  schedule. 


C(i,j)  =  m  inl<k<j 

(h(C(i,k),C(k,j)))}  } 


6  Concluding  Remarks 


This  source  program  is  first  transformed  in  a 
systematic  manner  by  applying  fan-in  and  fan¬ 
out  reductions  [5]  to  reduce  potential  concur¬ 
rent  accesses  of  variables.  The  result  is  Loop 
Nest  13.  Then  the  code  is  transformed  into 
three  *lisp  programs  on  the  Connection  Machine 
CM/2,  each  with  the  control  structure  generated 
by  a  2-SL  uniform  schedule,  a  mixed  statement- 
variant  schedule  and  an  SSL  subdomain  schedule 
respectively.  We  also  have  a  sequential  Common- 
Lisp  program  on  the  Symbolics  to  compute  the 


We  present  in  this  paper  a  formal  mathemat¬ 
ical  framework  which  unifies  the  existing  loop 
transformations.  We  also  present  more  general 
affine  and  piece-wise  affine  schedules  which  can 
extract  more  parallelism  from  a  class  of  pro¬ 
grams  than  the  existing  techniques.  The  partic¬ 
ular  class  of  programs  are  those  that  consist  of 
perfectly  nested  loops  possibly  with  conditional 
statements  where  the  guards  as  well  as  the  ar¬ 
ray  index  expression  are  affine  expressions  of  the 
loop  indices.  Although  the  complexity  for  ob- 
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Loop  Nest  13 


DO  (i  =  n  —  1, 1,  —1)  { 

DO  ( j  =  i  +  l,n){ 
m  =  (i  +  j  +  l)/2 
DO  (k  =  i,j)  { 

Sai  :  IF (*  <  j)A{i,j,k)  =  A(i,j  -  1,*) 

:  IF(t  +  1  =  k)  B(i.j,k)  =  C(i  +  1  ,j,j) 

Sb2  :  IF(  t  +  1  <  k)  B(i<  j,  k)  =  B(i  +  1  ,j,k) 

Sci  :  IF(m  =  k)C(i,j,  k)  =  hi(A{i,  j,k),B{i,  j,k), 

A(i,j,i  +  j  -  k),B(i,j,i  +  j  -  k)) 

SC2  :  IF(m  <  k  <  j)C(i,j,k)  =  h2(C(i,j,k-  l),A(i,j,k), 
B(i,j,k),A{i,j,i+  j  -  fc),  B(i.j,  i  +  j  -  k)) 

Sc 3  :  IF(Jb  =  j)C(i,j,k)  =  C(i,j,k-  1) 

Sa2  :  IF  (k  =  j)A(i,j,k)  =  C(i,j,k)  }}} 


n 

3-SL  sequential 

2-SL  uniform 

mixed  statement-variant 

SSL  subdomain 

32 

6.8 

10.72 

2.47 

0.87 

64 

55.0 

42.88 

9.73 

1.73 

128 

440.0 

171.50 

39.16 

3.48 

256 

3520.0 

686.45 

235.70 

6.96 

512 

28160.0 

2745.80 

1159.24 

31.70 

Figure  2:  Running  time  in  seconds. 
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Figure  3:  Running  time  vs.  problem  size. 
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taining  these  more  general  schedules  is  high  [12], 
we  show  that  the  generated  code  derived  from 
a  new  schedule  can  be  two  orders  of  magnitude 
faster  than  the  version  from  the  existing  trans¬ 
formations.  For  programs  not  in  this  particu¬ 
lar  class,  e.g.  programs  with  pointers,  compiler 
directives  can  be  added  into  the  sequential  pro¬ 
grams  to  help  the  compiler  to  generate  efficient 
parallel  codes. 
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